Latent Dirichlet Allocation
In summer semester 2016 I attended the seminar Algorithmic Methods in the Humanities at KIT. My talk was about Latent Dirichlet Allocation. The slides can be found here. The objective of this write-up is to explain what Latent Dirichlet Allocation is by giving a concrete example and to provide some intuition on how it works.
Topics
Topic models are used to discover topics in documents. A topic may be defined as a distribution over words; consider, for instance, the following distribution over words, representing computer science:
0.15*programming, 0.1*algorithm, 0.05*turing machine, 0.05*complexity, 0.05*software,… etc.
Latent Dirichlet Allocation is one instance of a topic model. LDA’s defining feature is the generative process assumption. In machine learning generative models are used to generate data points. Given some data (e.g. a text corpus) in a generative model it is assumed that there exists some process in the background which has generated the data. In the context of topic models, this means that the process selected a topic first and then decided which word to pick from that topic. The generative process has certain parameters and it is assumed that there must be a certain configuration that is most likely to have generated the observation. Now the goal is to infer those hidden parameters. Inference can be seen as reversing the generative process. That means that we want to uncover the most likely configuration that has produced a document. In the case of Latent Dirichlet Allocation the hidden variables are the per-document topic distribution, the per-document per-word topic assignments and the topics themselves.
Generative Process Pseudocode:
In Number of words \(N\), Number of topics \(K\), Dirichlet parameters \(\alpha \in \mathbb{R}_{> 0}^K, \beta \in \mathbb{R}_{> 0}^N\)
Out \(D\) documents (bag-of-words) with \(N\) words each
— Choose \(\theta_d \, \sim \, \mathrm{Dir}(\alpha)\)
— Choose \(\varphi_k \, \sim \, \mathrm{Dir}(\beta)\)
—For each word position \(d,n\)
—— Choose a topic \(z_{d,n} \,\sim\, \mathrm{Multinomial}(\theta_d).\)
—— Choose a word \(w_{d,n} \,\sim\, \mathrm{Multinomial}( \varphi_{z_{d,n}})\)
where \(d \in \{1,...,D\}\) and \(n \in \{1,...,N\}\).
\(\theta_d\), the topic distribution, and \(\varphi_k\), the distribution of words for topic \(k\), are distributed according to the so called Dirichlet distribution. A discussion about this distribution and its properties is omitted here, since it would only be of little help to gain insight.
The idea of the following code is simple: Wikipedia articles from three topics, namely computer science, philosophy and linguistics are put in one corpus. The second step is to let LDA reconstruct those three topics. (The complete notebook can be found here.)
The example makes use of gensim, a topic modelling library.
First, scrape some Wikipedia articles.
The following functions are used for preprocessing; the texts are tokenized and STOPWORDS
, like of, the, and, or, this, that, over,… are
removed, as they don’t convey much information.
The number of topics has to be chosen a priori. One can think of LDA as a clustering algorithm. Like K-Means, LDA needs the number of ‘clusters’ (i.e. topics) it is supposed to find in the data.
LDA produces the following three topics and seems to correctly reproduce the three clusters. Notice that, according to our dataset, the most important word for topic 1 is programming. This word defines the topic to a proportion of only ~1%.
Only the most prevalent words are listed in the following
We now have a trained model and can query a new instance. This means, that we can use the model to calculate how an unseen document, (i.e. a document that has not been used to train the model) is composed of the previously determined topics.
The 0.66235435896591133
inside the last tuple means that the Wikipedia article on Noam Chomsky is to
~66.24% about linguistics. Since Naom Chomsky is primarily a linguist, LDA satisfies our expectation.
Figure 1: Topic proportions for some of the Wikipedia articles that were used to train the model. The Wikipedia article on ‘Pythagoras’, for instance, is classified as a document in the category ‘philosophy’ with very high confidence. Note that albeit the confidence is very high it is not 100%. Due to the model every document exhibits all the K topics, but to a very different degree.
LDA can also be viewed from a dimensionality reduction perspective:
It performs dimensionality reduction, relating
each document to a position in low-dimensional "topic" space.
In this sense, it is analogous to PCA, except that it is explicitly designed for and works on discrete
data.
1
How it works
The goal is to infer hidden or latent variables (topics) from observed ones (words). We call the \(nth\) word from the \(dth\) document \(w_{d,n}\). Remember that we assumed that all \(w_{d,n}\) come from a generative process/model. This generative model defines several dependencies, which can be depicted with plate notation. This notation is a concise method for visualizing the factorization of the joint probability. A shaded node means that a random variable is observed.
Figure 2: Plate notation as a way to summarize conditional dependencies
The joint probability of the probabilistic graphical model depicted above is
\[P[X_1,\ldots,X_n] = \prod_{i=1}^n P[X_i | parent(X_i)] = \prod_{i=1}^n P[X_i | Y]\]Latent Dirichlet Allocation is a hierarchical Bayesian model. With plate notation there’s a concise way to represent the hierarchical structure. Be aware of the nested structure - the \(D\) plate contains the \(N\) plate, which means that the topic assignments \(z_{d,n}\) and the corresponding words \(w_{d,n}\) would be repeated \(N \times D\) times, if the plate notation was “unfolded”.
Figure 3: LDA represented in plate notation
In order to get the topics and the assignment of a word \(w_{d,n}\) to a certain topic (\(z_{d,n}\)), the inferential problem must be solved. This means, that we use the conditional dependencies to compute the posterior of the latent variables. With Bayes’ theorem:
\[p(\theta, z | w,\alpha, \beta) = \frac{p(\theta, z, w | \alpha, \beta)}{p(w | \alpha, \beta)}\]However, as it turns out, this is intractable to compute, which is quite common for posterior distributions and is due to the fact that a solution to the denominator of the above equation would involve integrating over all possible hidden topic structures 2.
One solution to this problem is called Gibbs sampling. Sampling is a way to approximate a solution when exact computation is infeasible. Let \(m_{d,k}\) denote the number of words that are assigned to topic \(k\) in document \(d\) and \(m_{k,w}\) the number of times \(w\) is assigned to topic \(k\) over the whole corpus. First every word \(w_{n,d}\) is randomly assined to one of the \(K\) topics. Then, in each iteration Gibbs sampling samples a word and reassigns it to one of the \(K\) clusters. This decision is made by evaluating the probability of the word belonging to the cluster \(c_i, \; i \in \{1,...,K\}\). The probability is defined to be the product of the number of words that are assigned to topic \(k\) in the document and the the number of times the sampled word \(w\) was assigned to topic \(k\) over the corpus:
\[m_{d,k} \times m_{k,w}\]The equation ensures that words from the same document have higher probability of being assingned to the same topic.
Example Let’s say there are ten documents with 1000 words each and the word Pythagoras occurs 20 times across those documents. We chose \(K=3\). In the initialization step this word was randomly assigned ten times to topic 1, six times to topic 2 and four times to topic 3.
Now we consider one of the 10 documents \(d\) and we assume that the topic distribution of \(\theta_d\) is approximately uniform, meaning that ~333 words were assigned to topic 1, topic 2 and topic 3. Now, if the Gibbs sampler comes across the word Pythagoras it will most likely (re)assign it to topic 1. The counts are updated: if Pythagoras was previously assigned to topic 2, then the word counts would look like this:
words | \(m_{\text{topic} 1,w}\) | \(m_{\text{topic} 2,w}\) | \(m_{\text{topic} 3,w}\) |
---|---|---|---|
Pythagoras | 11 | 5 | 4 |
….. | … | … | … |
Table 1: Word distribution (counts) after assigning Pythagoras to topic 1. \(m_{k,w}\) is the number of words assigned to topic k in doc d.
This procedure of (re)assigning words is repeated several times. Gibbs sampling will eventually converge, yielding a topic assingment for every word in the corpus. From those one can compute the topic distribution \(\theta_d\) for each document and the word distribution \(\varphi_k\) for each topic.
Conclusion
Topic models are a tool to find the hidden topic structure in large text corpora. The hidden structure must be uncovered in order to gain information about the low-dimensional structure behind a text corpus. A topic modeling algorithm takes a corpus as an input and will output the topics running through that corpus. The generative process is the essential idea behind Latent Dirichlet Allocation. Reversing the generative model (identifying hyperparameters) by applying Gibbs sampling yields a solution of the inference problem.
References
-
Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent dirichlet allocation.” Journal of machine Learning research 3.Jan (2003): 993-1022.
-
Darling, William M. “A theoretical and practical implementation tutorial on topic modeling and gibbs sampling.” Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies. 2011.
-
Griffiths, Thomas L., and Mark Steyvers. “Finding scientific topics.” Proceedings of the National academy of Sciences 101.suppl 1 (2004): 5228-5235.
Further Reading/Watching
- Blog post: Introduction to Latent Dirichlet Allocation (by Edwin Chen)
- There’s an excellent lecture on Latent Dirichlet Allocation by David Blei: Machine Learning Summer School (MLSS), Cambridge 2009
Footnotes
-
David M. Blei, Andrew Y. Ng, Michael I. Jordan: Latent Dirichlet Allocation. NIPS 2001: 601-608 ↩
-
[Quora] How is computing the posterior distribution in topic models intractable? ↩